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Abstract 

The linear coupling method was introduced recently by Allen-Zhu and Orecchia [14] for solving con¬ 
vex optimization problems with first order methods, and it provides a conceptually simple way to integrate a 
gradient descent step and mirror descent step in each iteration. In the setting of standard smooth convex opti¬ 
mization, the method achieves the same convergence rate as that of the accelerated gradient descent method of 
Nesterov [8], The high-level approach of the linear coupling method is very flexible, and it has shown initial 
promise by providing improved algorithms for packing and covering linear programs [1,2]. Somewhat surpris¬ 
ingly, however, while the dependence of the convergence rate on the error parameter e for packing problems 
was improved to 0(l/e), which corresponds to what accelerated gradient methods are designed to achieve, 
the dependence for covering problems was only improved to 0(l/e 15 ), and even that required a different 
more complicated algorithm. Given the close connections between packing and covering problems and since 
previous algorithms for these very related problems have led to the same e dependence, this discrepancy is 
surprising, and it leaves open the question of the exact role that the linear coupling is playing in coordinating 
the complementary gradient and mirror descent step of the algorithm. In this paper, we clarify these issues 
for linear coupling algorithms for packing and covering linear programs, illustrating that the linear coupling 
method can lead to improved 0( 1/e) dependence for both packing and covering problems in a unified man¬ 
ner, i.e., with the same algorithm and almost identical analysis. Our main technical result is a novel diameter 
reduction method for covering problems that is of independent interest and that may be useful in applying the 
accelerated linear coupling method to other combinatorial problems. 


1 Introduction 

A fractional covering problem, in its generic form, can be written as the following linear program (LP): 

min{c T x : Ax > 61, 

x>0 

where c £ R> 0 , b £ K.> 0 , and A £ That is, we want to put weights on the Xi- s, for i £ {1,..., n}, 

such that each j £ (1,..., m} is “covered” with weight at least bj, where each unit of weight on Xi puts Ay- 
weight on each j, and we want to minimize the cost c T x in doing so. Without loss of generality, one can scale 
the coefficients, in which case one can write this LP in the standard form: 

min{L r a: : Ax > 1}, (1) 

x>0 

where A £ " . The dual of this LP, the fractional packing problem, can be written in this standard form as: 

max{l T j/ : Ay < 1}. (2) 

y>0 

We denote by OPT the optimal value of the primal (1) (which is also the optimal value of the dual (2)). In this 
case, we say that x is a (1 + e)- approximation for the covering LP if Ax > 1 and l T x < (1 + e) OPT, and we 
say that y is a (1 — e)- approximation for the packing LP if Ay < 1 and 1 T y > (1 — e) OPT. 
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Packing and covering problems are important classes of LPs with wide applications, and they have long 
drawn interest in computer science and theoretical computer science. Although one can use general LP solvers 
such as interior point method to solve packing and covering with convergence rate of log(l/e), such algorithms 
usually have very high per-iteration cost, as methods such as the computation of the Hessian and matrix inversion 
are involved. In the setting of large-scale problems, low precision iterative solvers are often more popular 
choices. Such solvers usually run in time with a nearly-linear dependence on the problem size, and they have 
poly(l/e) dependence on the approximation parameter. Most such work falls into one of two categories. The 
first category follows the approach of transforming LPs to convex optimization problems, then applying efficient 
first-order optimization algorithms. Examples of work in this category include [1-3,7,8,11], and all except [1,2] 
apply to more general classes of LPs. The second category is based on the Lagrangian relaxation framework, and 
some examples of work in this category include [4—6,10,12,13]. For a more detailed comparison of this prior 
work, see Table 1 in [1], Also, based on whether the running time depends on the width p, a parameter which 
typically depends on the dimension and the largest entry of A, these algorithms can also be divided into width- 
dependent solvers and width-independent solvers. Width-dependent solvers are usually pseudo-polynomial, 
as the running time depends on p OPT, which itself can be large, while width-independent solvers are more 
efficient in the sense that they provide truly polynomial-time approximation solvers. 

In this paper, we describe a solver for covering LPs of the form (1). The solver is width-independent, 1 and it 
is a first-order method with a linear rate of convergence. That is, if we let N be the number of non-zeros in A, 
then the running time of our algorithm is at worst O (jSf l ° e ( jV / £ ) Io s( 1 / e ) ^ To simplify the following discussion, 

we will follow the standard practice of using O to hide poly-log factors, in which case the running time of our 
algorithm for the covering problem is at worst O (N/ e). Among other things, our result is an improvement over 
the recent bound of 0(N/e lb ) provided by Allen-Zhu and Orecchia for the covering problem using a different 
more complicated algorithm [1], and our result corresponds to the linear rate of convergence that accelerated 
gradient methods are designed to achieve [8], 

At least as interesting as the 0(1/e 0 5 ) improvement for covering LPs, however, is the context of this problem 
and the main technical contribution that we developed and exploited to achieve our improvement. 

• The context for our results has to do with the linear coupling method that was introduced recently by 
Allen-Zhu and Orecchia [14]. This is a method for solving convex optimization problems with first order 
methods, and it provides a conceptually simple way to integrate a gradient descent step and mirror descent 
step in each iteration. In the setting of standard smooth convex optimization, the method achieves the 
same convergence rate as that of the accelerated gradient descent method of Nesterov [8], and indeed the 
former can be viewed as an insightful reinterpretation of the latter. The high-level approach of the linear 
coupling method is very flexible, and it has shown initial promise by providing improved algorithms for 
packing and covering LPs [1,2], 

The particular motivation for our work is a striking discrepancy between bounds provided for packing and 
covering LPs in the recent result of Allen-Zhu and Orecchia in [1], In particular, they provide a (1 — e)- 
approximation solver for the packing problem in 0(N/e), but they are only able to obtain 0(N/e 15 ) for 
the covering problem, and for that they need to use a different more complicated algorithm. This discrep¬ 
ancy between results for packing and covering LPs is rare, due to the duality between them, and it leaves 
open the question of the exact role that the linear coupling is playing in coordinating the complementary 
gradient and mirror descent step of the algorithms for these dual problems. 

• Our main technical contribution is a novel diameter reduction method for fractional covering LPs that 
helps resolve this discrepancy. Recall that the smoothness parameter, e.g., Lipschitz constant, and the di¬ 
ameter of the feasible region are the two most natural limiting factors for most gradient based optimization 
algorithms. Indeed, many applications of general first-order optimization techniques can be attributed to 
the existence of norms or proximal setups for the specific problems that gives both good smoothness and 
diameter properties. In the particular case of coordinate descent algorithms based on the linear coupling 
idea, we additionally need good coordinate-wise diameter properties to achieve accelerated convergence. 

This is easy to accomplish for packing problems, but it is not easy to do for covering problems, and 
this is this difference that leads to the 0(1/e 0 5 ) discrepancy between packing and covering algorithms in 
previous work [ 1 ]. Our diameter reduction method for general covering problems is straightforward, and it 

'More precisely, our method has a logarithmic dependence on the width, but by Observation 4.2 below, this cannot be worse than 
log(nm/e), and thus we consider it as width-independent. 
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gives both good diameter bounds with respect to the canonical norm for accelerated stochastic coordinate 
descent (as is needed generally [1,9]) as well as good coordinate-wise diameter bounds (as is needed for 
linear coupling [1]). Thus, it is likely of interest more generally for combinatorial optimization problems. 

Once the diameter reduction is achieved, the remaining work is mainly straightforward, as we can directly apply 
known optimization schemes that work well for problems with good diameter properties. In particular, by using 
the scheme from [ 1 ] that was developed for packing LPs, we obtain improved O (N /e) results for covering LPs; 
and this provides a unified acceleration method (unified in the sense that it is with the same algorithm and almost 
identical analysis) for both packing and covering LPs. 

We will start in Section 2 with a description of some of the challenges in applying acceleration techniques 
in a unified way to these two dual problems, including those that limited previous work. Then, in Section 3 
we will present our main technical contribution, a novel diameter reduction method for any covering LP of the 
form given in (1). Finally, in Section 4 we describe how to combine this with previous work to obtain a unified 
acceleration method for packing and covering problems. We include a full description of the latter analysis, with 
some of the details deferred to Appendix A. 

2 High-level Description of Challenges 

At a high level, we (as well as Allen-Zhu and Orecchia [1,2]) use the same two-step approach of Nesterov [8]. 
The first step involves smoothing, which transforms the constrained problem into a smooth objective function 
with trivial or no constraints. By smooth, we mean that the gradient of the objective function has some property in 
the flavor of Lipschitz continuity. Once smoothing is accomplished, the second step uses one of several first order 
methods for convex optimization in order to obtain an approximate solution. Examples of standard application 
of this approach to covering LPs includes the width-dependent solvers of [7,8] as well as multiplicative weights 
update solvers [3], 

The first width-independent result following the optimization approach in [2] achieves width-independence 
by truncating the gradient, thus effectively reducing the width to 1. The algorithm uses, in a white-box way, the 
coupling of mirror descent and gradient descent from [14], which can be viewed as a re-interpretation of Nes¬ 
terov’s accelerated gradient method [8]. However, although [2] uses a coupling of mirror descent and gradient 
descent, the role of gradient descent is only for width-independence, i.e., to cover the loss incurred by the large 
component of the gradient (see Eqn. (7) below for the precise formulation of this loss), and it is independent of 
the mirror descent part acting on the truncated gradient. In addition, [2] deviates from the canonical smoothing 
with entropy, as it instead uses generalized entropy. Importantly, the objective function to be minimized is not 
smooth in the standard Lipschitz continuity sense, but it does satisfy a similar local Lipschitz property. 

To improve the sequential packing solver in [2] with convergence 0( 1/e 3 ) to 0{ 1/e), the same authors 
in [1] apply a stochastic coordinate descent method based on the linear coupling idea. Barring the difference 
between Lipschitz and local Lipschitz continuity, the results in [1] can be viewed as a variant of accelerated 
coordinate descent method [9]. There are two places where the algorithm achieves an improvement over prior 
packing-covering results. 

• One factor of improvement is due to the better coordinate-wise Lipschitz constant over the full dimensional 
Lipschitz constant. Intuitively, in the case of packing or covering, the gradient of variable Xi depends on 
the penalties of constraints involving Xi, which further depend on all the variables in those constraints. As 
a result, if we move all the variables simultaneously, we can only take a small step before changing the 
gradient of x-i drastically. 

• The other factor of improvement comes from accelerating the gradient method. The role of gradient 

descent in the packing solver of 11] is twofold. First, it covers the loss incurred by the large component 
of the gradient as in [2] to give width-independence. Second, to accelerate the coupling as in [14], the 
gradient descent also needs to cover the regret term incurred by the mirror descent step (see Eqn. (7) below 
for the precise formulation of this regret). The adoption of /1-norm (defined in Eqn. (6) below) enables the 
acceleration. This /l-norm works particularly well for packing problems, in the sense that it easily leads 
to good diameter bounds: since the packing constraints impose a naive upper bound of x* < l/||A,||oo 
on each variable, thus the feasible region has a small diameter y( xo ) ||cc — x* ||a- 

The importance of the small diameter is twofold. First, the diameter naturally arises in the convergence 
bound of gradient based methods, so we always need to use a norm or proximal setup giving small diameter 
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to achieve good convergence. Second, and more importantly, in this case the small diameter [0,1/1| A^loo] 
on each coordinate relates the mirror descent step length and the gradient descent step length. As the re¬ 
gret term in mirror descent and the improvement of gradient descent step are both proportional to their 
respective step lengths, the small coordinate-wise diameter makes it possible to use gradient descent im¬ 
provement to cover the mirror descent regret. 

The combination of gradient truncation, stochastic coordinate descent, and acceleration due to small diameter in 
,4-norm leads to the 0(N /e) solver for the packing LP [1], 

Shifting to solvers for the covering LP, one obvious obstacle to reproducing the packing result is we no longer 
have the small diameter in /l-norm. Indeed, a naive coordinate-wise upper bound from the covering constraints 
only gives x* < 1/ min ; { Aj t : Aj t > 0}. Because of this, the covering solver in [1] instead use the proximal 
setup in their earlier work [2], The particular proximal setup gives a good diameter for the feasible region they 
use, but it doesn’t give a similarly good coordinate-wise diameter to enable the acceleration. To improve upon the 
0(l/e 2 ) convergence of standard mirror descent, the authors use a negative-width technique as in [3] (Theorem 
3.3 with l — \fe). This then leads to the (improved, but still worse than for packing) 0(l/e 15 ) convergence 
rate. In addition, since they truncate the gradient at a smaller threshold to cover the loss incurred by the large 
component, they need a more complicated gradient step, leading to a more complicated algorithm than for the 
packing LP. 

To get an 0(1/e) solver for the covering LP, it seems crucial to relate the gradient descent step and mirror 
descent step the same way as in the packing solver in [ 1 ]. Thus, we will stick with the .4-norm, and we will work 
directly to reduce the diameter. Our main result (presented next in Section 3) is a general diameter reduction 
method to achieve the same diameter property as in the packing solver, and this enables us (in Section 4) to 
extend all the crucial ideas of the packing solver in [1], as outlined in this section, to get a covering solver with 
running time 0(N/e). 

3 Diameter Reduction Method for General Covering Problems 

Given any covering LP of the form given in (1), characterized by a matrix A, we formulate an equivalent covering 
LP with good diameter properties. This will involve adding variables and redundant constraints. We use i £ [n] 
to denote the indices of the variables (i.e., columns of A) and j £ \m\ to denote the indices of constraints (i.e., 
rows of A). For ease of comparison with [1], and since our unified approach for both packing and covering uses 
their packing solver and a similar analysis, we use the same notation whenever possible. 

For any i £ [n], let 

def n i;i X y |.' 1 j j . Aji >0} 

* min j{Aji : Aji > 0} ’ 

be the ratio between the largest non-zero coefficient and the smallest non-zero coefficient of variable Xi in all 
constraints, and let rii=\ log rf\. We first duplicate each original variable ni times to obtain x^ ^ , i £ [n], l £ 

— TIT X ( y" ^ • j 

[iii] as the new variables. In terms of the coefficient matrix, we now have a new matrix, call it A > 

which contains rii copies of the i-th column A :J ;. We denote a column of A by the tuple (i, l ) with l £ [rij]. 
Obviously, the covering LP given by A is equivalent to the original covering LP given by A. Adding additional 
copies of variables, however, will allow us to improve the diameter. To reduce the diameter of this new covering 
LP, we further decrease some of the coefficients in A, and we put upper bounds on the variables. In particular, 
for j, i, l , we have 

Aj t ( it i) = min {A jti , 2 l min {Aji : Aji > 0}}, (3) 

and for variable X(ij), we add the constraint 


2 l minj{Aji : Aji >0} ^ ^ 

The next lemma shows that the covering LP given by A and the covering LP given by A are equivalent. 

Lemma 3.1. Let OPT be the optimal value of the covering LP given by A, and let OPT be the optimal of the 
covering LP given by A and (4), as constructed above; then OPT = OPT. 
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Proof. Given any feasible solution x, consider the solution x where Xi = Ym=i ^ * s obvious Px = l T x, 
and Ax > 1, as coefficients in A are no larger than coefficients in A. Thus OPT < OPT. 

For the other direction, consider any feasible x. For each i, we can assume without loss of generality that 


Xi < 


1 

mxn.j{Aji : Aji > 0 } 


Let li be the largest index such that 


2 

1 ' ~ 2 li minjlAjj : Aji > 0} ’ 


and then let 


f Xi if l li 
\ 0 if iph 


By construction, x satisfies all the upper bounds described in (4). Furthermore, for constraint j, we must 
have Aj-x > 1. Since for any i, Ajj^i^ differs from Aji only when Aji > 2 li min 7 - {Aji : Aji > 0}, and we 
must have f < rii in this case by definition of ?ii , which gives X(n ) = Xi > -p —:————r by our choice 

V ’ ^ t miHj ^-**-ji " X*.ji ^Uj 

of Ip being the largest possible. Then we know Aj^j.) = 2 li mmj{Ajj : Aj. t > 0}, so the j-th constraint is 
satisfied. Thus OPT > OPT, and we can conclude OPT = OPT. □ 


Given that we have shown that the covering LP defined by A and that defined by A are equivalent, we now 
point out that the seemingly-redundant constraints of (4) turn out to be crucial. The reason is that the feasible 
region now has a small diameter in the coordinate-wise weighted 2-norm || • | 4 . In particular, we can rewrite the 
constraints (4) to be 

2 

X (M) — TTJ ij • 

For any this is the same upper bound on X(i } i) for l < m (consider the row j* = argmax^ { A^, Aji > 0 }), 
and it is a relaxation on xu tV/i ). 

The price we pay for this diameter improvement is that the new LP defined by A is larger than that defined 
by A. Two comments on this are in order. First, by Observation 4.2 below, Ti is bounded by n 2 m/e 2 , and 
so the diameter reduction step only increases the problem size by ()(\og(iiin/e)). Second, we have presented 
our diameter reduction as an explicit pre-processing step so we can use one unified optimization algorithm 
(Algorithm 1 below) for both packing and covering, but in practice the diameter reduction would not have to be 
carried out explicitly. It can equivalently be implemented implicitly within the algorithm (a trivially-modified 
version of Algorithm 1 below) by randomly choosing a scale after picking the coordinate i and then computing 
Aj ^ijj in (3) by shifting bits on the fly. 

Given this reduction, in the rest of the paper, when we refer to the covering LP, we will implicitly be referring 
to the diameter reduced version, and we have the additional guarantee that there exists an optimal solution x* to 
( 1 ) such that 

0 < x* < 2 V* e [n]. (5) 

|| || 00 


4 An Accelerated Solver for (Packing and) Covering LPs 

In this section, we will present our solver for covering LPs of the form (1). To motivate this, recall that for 
packing problems of the form (2), bounds of the form (5) automatically follow from the packing constraints 
Ax < 1. For readers familiar with the packing LP solver in [1], it should be plausible that—once we have this 
diameter property—the same stochastic coordinate descent optimization scheme will lead to a 0(N/e) covering 
LP solver. We now show that indeed the same optimization algorithm for packing LPs can be easily extended to 
solving covering LPs, thus establishing a unified acceleration method for packing and covering problems. 

In Section 4.1, we’ll present some preliminaries and describe how we perform smoothing on the original 
covering objective function; and then in Section 4.2, we’ll present our main algorithm. This algorithm involves 
a mirror descent step, that will be described in Section 4.3, a gradient descent step, that will be described in 
Section 4.4, and a careful coupling between the two, that will be described in Section 4.5. Finally, in Section 4.6, 
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we will describe how to ensure we start at a good starting point. Some of the following results are technically- 
tedious but conceptually-straightforward extensions of analogous results from [ 1 ], and some of the results are 
restated from [ 1 ]; for completeness, we provide the proof of all of these results, with the latter relegated to 
Appendix A. 

4.1 Preliminaries and Smoothing the Objective 

To start, let’s assume that 

min PtIIoo = 1 - 

j£[m] 

This assumption is without loss of generality: since we are interested in multiplicative (1 + e)-approximation, 
we can simply scale A for this to hold without sacrificing approximation quality. With this assumption, the 
following lemma holds. (This lemma is the same as Proposition C.2.(a) in [1], and its proof is included for 
completeness in Appendix A.) 

Lemma 4.1. OPT € [1 , m] 

With OPT being at least 1, the error we introduce later in the smoothing step will be small enough that the 
smoothing function approximates the covering LP well enough with respect to e around the optimum. 

Observation 4.2. Since we are interested in a (1 + e) -approximation, then with the above assumption, we can 
also eliminate the very small and very large entries from the matrix as follows. If some entry Aj, < e/(mn), 
then since OPT < m we have that AjjX* < e/n, and so we can just increase each variable by e/n, in which 
case we can recover the loss from setting Aji equal to 0 from the variable hi the j-th constraint with coefficient 
at least 1. On the other hand, if some entry Ap > n/e, then we can just set variable i to be at least e/n and 
ignore constraint j. Thus, we can eliminate very small and very large entries from the matrix A, and we only 
incur an additional cost of e, but since OPT > 1, we still obtain a (1 + 0(e))-approximation. 

We will turn the covering LP objective into a smoothed objective function f^(x), as used in [1,2], and we are 
going to find a (1 + e)-approximation of the covering LP by approximately minimizing fffix) over the region 

A = {* G R" : 0 < Xi < 

The function /^(x) is 

fn( x ) = + niax{t/ T (l — Ax) + pH(y)}, 

y> o 

and it is a smoothed objective in the sense that it turns the covering constraints into soft penalties, with H(y) 
being a regularization term. Here, we use the generalized entropy H(y) = — yj log yj + yj, where p is the 
smoothing parameter balancing the penalty and the regularization. It is straightforward to compute the optimal 
y, and write / ;i (x) explicitly, as stated in the following lemma. 

Lemma 4.3. fffix) = 1 T x + where Pj(x) = exp(2-(l — ( Ax)j )). 

Optimizing / M ( x) over A gives a good approximation to OPT, in the following sense. If we let x* be an 
optimal solution satisfying (5), and u* = (1 + e/2)x* € A, then we have the properties in the following lemma. 
(This lemma is the same as Proposition C.2 in [1], and its proof is included for completeness in Appendix A.) 

Lemma 4.4. Setting the smoothing parameter p = i \ og ^ nm / e ) > we have 

1■ /„(«*) < (1 + e) OPT. 

2. f/ f x) > (1 — e) OPT for any x > 0. 

3. For any x>D satisfying ffj,(x) < 2 OPT, we must have Ax > (1 — e) 1. 

4. If x > 0 satisfies f/j,(x) < (1 + 0(e)) OPT, then yzr^x is a (1 + 0(e))-approximation to the covering LP. 
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5. The gradient of f l _ l (x) is 

V/ M (x) = 1 — A T p(x) where Pj(x) = exp(— (1 — (Ax)j), 

d 

and \7if^(x) = 1 - Ej A jiPj ( x ) e h°°> !]• 

Although fn{x) gives a good approximation to the covering LP, we cannot simply apply the standard (ac¬ 
celerated) gradient descent algorithm to optimize it, as /^(x) doesn’t have the necessary Lipschitz-smoothness 
property. However, J), (x) is locally Lipschitz continuous , in a sense quantified by the following lemma, and so 
we have a good improvement with a gradient step within certain range. (The following is a “symmetric” version 2 
of Lemma 2.6 in [1].) 

Lemma 4.5. Let L=—, for any x £ A, and i £ [n] 

1. If Vi fpix) £ (—1,1), then for all |^y| < —» we have 

IVi/ M (x) - V;/ A1 (x + 7 e.j)| < Lll^llooM- 

2. //V,;/ At (x) < — 1, then for all 7 < L | |^. | | —, we have 

V i / M (x + 7e i ) < (1- L||A 2 ll|o ° |7l)VJ M (x). 


Proof. First, observe the following: 

1 - yjffjjx + 7e.j) | 


log- 


1 - 


< 


^0 

1 

d Jo 


VaU(x + vej) 

1 - V.,/ M (x + z/ej) 


c?z/ 


1 /" - 'j,• ue i) 


dJ 0 + 


<iz/ 


7 


Palloo*/ 


= -MiiA.iu = 

fj, 4 


Then, we have 


e xp( _Mk| 7 |) < 1 - exp(Mk| 7 |). 

Since |^| < I by our assumption, we have x < e® — 1 < 1.2x for x £ |]. Thus, it follows that 


L\\A :i \\ 00uA ^ Vi/ M (x)-V i / /1 (x + 7 e;) ^ , 0 £||A :i 


< 1 . 2 - 


L ItI- 


-I 7 I < 

4 In “ 1 -Vi/ M (x) 

Finally, to prove the lemma we consider the following two cases: 

1. IfVj/„(x) £ (— 1 , 1 ), then we have 

|V i / /i (x)-V i / /i (x + 7 e i )| < L 2 (l-V i /^(x)) LP ; l|oo i 7 l<T|lA i || 00 l 7 l. 


2 . If Vi/ M (s) < -1, then 1 - V;/ M (x) < -2V i / #t (x), and 

V < /„(* + 7 e i ) < Vif^x) + (1 - V i / /J (x)) L||A ; l|o ° l 7 l < (1 - L||A ; illo ° |7l)V/ M (x). 


□ 

We call LHA^Iloo the coordinate-wise local Lipschitz constant. For readers familiar with accelerated coor¬ 
dinate descent method (ACDM) [9], the A-norm is essentially the || • ||i_ a in ACDM [9] with a = 0, except 
we use the coordinate-wise local Lipschitz constant instead of the Lipschitz constant to weight each coordinate. 
The significance of Lemma 4.5 is that for covering LPs the coordinate-wise diameter is inversely proportional to 
the coordinate-wise local Lipschitz constant. (This fact has been established previously for the case of packing 
LPs [1].) 
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Algorithm 1 Accelerated stochastic coordinate descent for both packing and covering 

Input: A € R> a: start g A, / ;J , e Output: y T g A 

1: f 1 4 log(ram/e) ’ ^ ‘ S_ /J ’ T 8nL 

2: Tg [8?zLlog(l/e)l = 0(f) 

3 : X 0) 2 /o, 20 ^ Start , «0 ^7; 

4 : for k = 1 to T do 

5: ctfc 

6: g- T2: fe _i + (1 - r)j/fc_i 

7: Select * g [n] uniformly at random. 

> Gradient truncation: 

r -i ViU(x k ) < -i 

8: Let 4* <— < V if[i(xk) Vi/ M (* fc )g[-l,l] 

[ 1 ^ifn(Xk) > 1 

> Mirror descent step: 

9: z k g- 4° = argmiii zgA { V Zfc _ 1 ( 2 ) + ( 2 , na^f)}- 

> Gradient descent step: 

„ ~ (i) def , 1 / (i) \ 

10: Vk <- 2/fc = - *k-l) 

11: end for 
12: return yr- 


4.2 An Accelerated Coordinate Descent Algorithm 

We will now show that the accelerated coordinate descent used in packing LP solver in [1] also works as a 
covering LP solver, with appropriately-chosen starting points and smoothed objective functions. Consider Al¬ 
gorithm 1, which is our main accelerated stochastic coordinate descent for both packing and covering. This 
algorithm takes as input a matrix A g R™^™, an initial condition a; start g A, a smoothed function /),, and an 
error parameter e, and it returns as output a vector ijt £ A. The correctness of this algorithm and its running 
time guarantees for the packing problem have already been nicely presented in [ 1 ], and so here we will focus on 
the covering problem. 

Our main result is summarized in the following theorems. 

Theorem 4.6. With £ start computable in time O(N) to be specified later, Algorithm 1 outputs ]Jt satisfying 
E [ffi(jjT)] < (1 + 6 e) OPT, and the running time is 0(N/e). 

Given Theorem 4.6, a standard application of Markov bound, together with part 5 of Lemma 4.4, gives the 
following theorem as a corollary. 

Theorem 4.7. There is a algorithm that, with probability at least 9/10, computes a (1 + 0(e))-approximation 
to the fractional covering problem and has 0(N/e) expected running time. 

Not surprisingly, due to the structural similarities of packing and covering problems after diameter reduction, 
the correctness of Algorithm 1 for covering can be established using the same approach as [1] did for packing. 
The modifications are fairly straightforward, and we will point out the similarities whenever possible. 

Before proceeding with our proof of these theorems, we discuss briefly the optimization scheme from [1] we 
will use. First, observe that the ,1-norm, where 


IMU = 



(6) 


is used as the proximal setup for mirror descent. The corresponding distance generating function is w(x) = 
^||x||^, and the Bregman divergence is V x (y) = ^\\x — y\\\? 

2 The smoothed objective function for packing LP is — 1 T y + // Yf 'f— 1 (y), where qj{y) = exp( — (( Ay)j — 1)), which is symmetric 
to The properties of f^{x) inherit the symmetry to its packing counterpart, and it can be derived with the same way as [1] used for 

the packing function, but we include it’s proof to highlight differences. 

3 In particular, w is a 1-strongly convex function with respect to || • \\a, and V x (y) = iu(y) — (Vi u(x),y — x) — w(x). See [14] for a 
detailed discussion of mirror descent as well as and several interpretations. 







Next, observe that Algorithm 1 works as follows. Each iteration integrates a mirror descent step and a 
gradient descent step. The standard analysis of mirror descent gives a convergence of 4, and it depends on 
the width of the problem. Thus, to get a width-independent 0{^-) solver, we need to show that Algorithm 1 
addresses both of these issues. 

• In order to eliminate the width from the convergence rate, the gradient V, / ; , ( x k ) is split into the small 
component, 4.'^ = max{—1, Vj/ M (xfc)} e», and the large component, ry) 1 = Vi/ M (a;fe) e; — ■ Only the 
small component i s given to the mirror descent step, and thus the width is effectively 1. However, the 
truncation incurs loss from the large component, as the mirror descent only acts on the small component. 
Following [2], the improvement from the gradient descent step is used to cover that loss. 

• In order to improve the 1 /e 2 rate, recall that the 1/e 2 in the convergence of mirror descent is largely due to 
the regret term accumulated along all iterations of mirror descent. In order to get to 1 /e, the improvement 
from the gradient step also need to cover the regret from the mirror descent step (see Eqn. (7) below for 
the precise formulation of this loss and regret). This enables us to telescope both the loss and the regret 
through all iterations and to bound the total by the gap between / At ( x start ) and the optimal. The remaining 
terms in the mirror descent also telescope through the algorithm, and they are bounded in total by the 
distance (in /1-norm) from a; start to u* £ A. 

Then, given these, all we need is an initial condition x start that is not too far away from the optimal in terms of 
the function value and not too far away from u* in .1-norm. For packing, starting with all 0’s will work. For 
covering, we will show later a good enough x start can be obtained in O(N). 

Finally, here are some lemmas about the algorithm. The following two lemmas are invariant to the differences 
between packing and covering problems, and so they follow directly from the same results in [1] (but, for 
completeness, we include the proofs in Appendix A). The values of parameters ji. L, t, a:/,, can be found in the 
description of Algorithm 1. The first lemma says that the gradient step we take is always valid (i.e., in A), which 
is crucial in the sense that the gradient descent improvement is proportional to the step length, and we need the 
step length to be at least L of the mirror descent step length for the coupling to work. 

Lemma 4.8. We have Xk,yk, Zk £ A for all k = 0,1,... ,T. 

The second lemma is clearly crucial to achieve the nearly linear time 0{N/e) algorithm. 

Lemma 4.9. Each iteration can be implemented in expected 0(N/n) time. 


4.3 Mirror Descent Step 

We now analyze the mirror descent step of Algorithm 1: 

Zk = argmin{I4 fc _i (z) + (z,na k £k ) }}. 

zG A 

A lemma of the following form, which here applies to both covering and packing LPs, is needed, and it’s proof 
follows from the textbook mirror descent analysis (or, e.g.. Lemma 3.5 in [1]). 

Lemma 4.10. (na k $\z k -i - u*) < n 2 alL(^\x k - y^) + V Zk _ 1 (u*) - V zk (u*) 

Proof. The lemma follows from the following chain of equalities and inequalities. 


(na k Ck ,Zk -1 _ u *) = {na k Ck ,z k -i ~ z k) + ( na kCk, z k ~ u*) 

= n 2 a 2 k L{£ M ,x k - y ( k l] ) + ( na k ^\z k - u*) 

< n 2 a 2 L(^,x k - yf) + (-V1V, (*£>), - u*) 


< n 2 a 2 L(^, Xk - y«> + V Zk _M*) - F«(u*) - V^zf) 

< n 2 a 2 L(^\x k - yf) + - V Zk (u*f 


The first equality follows by adding and subtracting z k , and the second equality comes from the gradient step 
y k = x k + ran L ( z k ^ — z k - 1 ). The first inequality is due to the the minimality of z k \ which gives 

(VV z fc _!(4 4) ) + notk€ k \u- z k ) >0 Vu £ A, 
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the second inequality is due to the standard three point property of Bregman divergence, that is Vcc, y > 0 

(-V14 (y),y- u) = V x {u) - V y (u) - V x (y), 

and the last inequality just drops the term —V Zk (u*), which is always negative. □ 

Also, we note that the mirror descent step, defined above in a variational way, can be explicitly written as 

i (i) 

i. 4 <- Zk-i 

2 - z k ] 45 - na^W/HA.IU 

3. if45 < 0,45 <- 0; if 45 > 3/11^11^,45 3 /H^lloo- 

This is invariant to the difference of packing and covering, and so it follows directly from Proposition 3.6 in [1], 
It is fairly easy to derive, and so we omit the proof. 


4.4 Gradient Descent Step 

We now analyze the gradient descent step of Algorithm 1. In particular, from the explicit formulation of the 


mirror 


descent step, we have that |45 — z k-i,i\ < which gi 


i gives 


1 




45 - zfc-i,»i < 


i^fc 


LWA-^Woo 

The gradient step we take is within the local region, and so Lemma 4.5 applies. We bound the improvement 
from the gradient descent step in the following lemma, which is symmetric 4 to Lemma 3.8 in [1], 

Lemma 4 . 11 . / M (x fc ) - / M (45) > \^fn{xk), x k - 45) 

(i) (i) 

Proof. Since x k and y. differ only at coordinate i, denote 7 = Vki~ x k,i > we have 


U(xk) - /m(45) = fv( x k) - f^{xk + 7 e i) = / ~Vif M (xk + ve.i)dv. 


|t(01 

Since 7 satisfies | 7 | < 


< 


we can apply Lemma 4.5. There are two cases to consider. 


- 

If Vj/ / 1 (a;fc) S (-1,1), then we have | 7 | < J^.l = wip , and by Lemma 4.5 we have + 


L\\A-, 


L\\A:i\\o 


vef) > —Vj/ M ( Xk) — L||A. : j||oo|^| in the above integration. Thus, 


fn{x k ) - / m ( 45 ) > f -ViU(xk + vGi)dv 
Jo 
ri 

> / -^ifnixk) - L\\A-.i\\oo\v\dv 
J 0 


•net \ -^ || || 00 2 

= -Vifnixkn - 2 — 7 

= ~{ViU(xk), 7 ) = ^(V/ M (x fc ),a; fe - 4°)- 

If Vifnixk) < - 1 , then again by Lemma 4.5 we have-V i / /i (a;fe + ^e i ) > -(1- L|l 5/ l|o ° \u\)Vif^(x k ) > 


-b^ifn(xk)- Thus, 


fn(xk) ~ S^k) > [ ifti{xk +vei)dk 

J 0 


> 


-o V ifn{xk)dv = -( v /m( x k ),x k - 4* )• 


□ 


4 The symmetry is between Lemma 2.6 in [1] and Lemma 4.5, as the gradient descent improvement follows directly from the correspond¬ 
ing Lipschitz properties. The actual improvement guarantee is the same as Lemma 3.8 in [1], 
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4.5 Coupling of Gradient and Mirror Descent 

Here, we will analyze the coupling between the gradient descent and mirror descent steps. This and the next 
section will give a proof of Theorem 4.6. 

As we take steps on random coordinates, we will write the full gradient as 

V/ M (x/ S ) = Ei[nVi/ M (x fc )] = E i[nyg + ngg]. 

As discussed earlier, we have the small component £g € (— 1,1) e* and the large component yg = Vi/ M ( Xk) — 
£g € (—oo, 0] e, . We put the gradient and mirror descent steps together, and we bound the gap to optimality at 
iteration k: 


otk{fy.{xk) - /m( m *)) <WV/ M (x fe ),a;fc - u*) 

={a k Vf t _ l {x k ),Xk - z k -i ) + {a k Wfk,(xk),Zk -i - u*) 

=(ctfcV/ M (x fe ), x k - z k -i) + Ei[(na k rik\ z k -i - u*) + (na k ^ k \z k -i - u*)] 

1 — T / 

= otk{y fii(x k ),yk -1 ~ Xk) +Ei[(na k rik , z k -i - «*}] 

+ E,; [( na k £g > Zk -i ^ u*)} 

1 — 'T ( 

<- otkU^Vk- 1 ) - f^(x k )) + Ei[(na k rik ,z k -i - u*)] 

T 

+ E i [n 2 alL($\ Xk - yg) + V Zk _^u*) - V «(«*)]■ 

The first line is due to convexity. The next two lines just break and regroup the terms. The fourth line is due to 

Xk = TZk -1 + (1 — r)yk- i, so r(x k — z k - i) = (1 — T)(y k -i — x k ). The last line is by Lemma 4.10. 

(i) 

We try to use the improvement from the gradient step given in Lemma 4.11 to cover the loss from if k ', and 
the regret from the mirror descent step: 

Ei[(na k rik\Zk-i - «*)] + E i [n 2 a 2 k L{^ ) ,x k - yg > }]> (7) 

---—-—-—-- — — 

loss from re 8 ret from mirror descent 


and we will use the fact z k - i, z k \u* € A. Consider the following cases. 

(i) 

1. y) = 0: In this case, the loss term is 0. We only need to worry about the regret term, and by Lemma 4.11 


2 al L {Ck,Xk - yg) < 2n 2 alL{f tl (x k ) - / M (yg))- 


2. yg < 0, zjt'j < —: In this case, we increased the i-th variable in both the gradient and mirror descent 

step, and because z k \ is inside A without any projection, we know the step length of gradient descent is 
exactly yg. - x k ,i = n g L = lUXTT iCT together with z k - 1 > 0, and u* < we have 


14, 


L\\A:i\\c 


IIA, 


3 

(na k r]k \ Zk-i - u*) < (na k r] k \ -u*) < -nak'Vif /J .{xk )— = 3 na k L{\7f^x^iXk - yg), 


and 


P=i||c 


(na k rik\zk -i - u*) + n 2 a 2 k L((%\x k - yg) <(3 na k L + n 2 a 2 k L)(\/f fl ( x k ), x k - yg) 

<(6 na k L + 2n 2 a 2 k L){f kl (x k ) - / M (yg)). 


The last step is by Lemma 4.11. 


3. yg < 0, z k ] = | | A j —: In this case, as we know u* < -p-^—, we have 


(na k r][!\ z k -x - u*) < {na k 7]^\z k - i - z g) = n 2 alL(r]^\x k - yg), 
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and 


{na k r]k\z k -1 - u*} + n 2 a k L{^ k ’ ,x k - y ( k ’) <2 n 2 a k L(Vf^{x k ),x k - y ( k ’) 

<4n 2 a|L(/ M (a; fc ) - f^y^)). 


t(0 


(i)\ 


,W\ 


Again, the last step is due to Lemma 4.11. 

Since na k < 1 for all k, we have in all above cases, 

Ki[(na k Vk\ z k - 1 - u*)] + Ei[n 2 a 2 k L(^ k l \ x k - y[ l) }] < E,[8 na k L{fy,{x k ) - / M (t/fe } ))]- 
Back to our earlier derivation, we have 

oikiffj.(x k ) - /„(«*)) <---afc(/^(j/fe-i) - fn{x k )) +~Ei\(na k y k \ z k _i - «*)] 

r 

+ E i[n 2 alL{^ k \x k - y^ } ) + (u*) - kw(«*)] 

<-— -ak(U(y k - 1 ) - fn(x k )) +E ?; [8 na k L{f^(x k ) - f k {.y k ] )\ 

T 

+ E i [V Zk _ 1 (u*)-V z p(u*)}. 

With our choice of r = g4_, a k = jz^ct k -i, we have 


-oikU(u*) < 8nLa k -if^(y k -i) -E i [8nLa k f l _ l {y ( ' k ' > )} +E i [V Zk _ 1 (u*) - V ,(*)(«*)]• 

Telescoping the above inequality along k = 1,..., T, we get 

T 

Ei[8nLa T f^(yT)} < Y ^Qtkfn(u*) + SnLaof^iyo) + 14 0 (w*), 

fc=i 


and thus 


We have ^I=i a fc = “t 2T=o ( 1 - s^l)* = 8?rLo;T(l - (1 - g^) T ) < 8nL« T , and by our choice of 
T = [8nX log(l/e)], we also have 


cto 

cut 


(1 


1 , T 1 e 

-) < e,-< - 

8 nL 8nLa,T 8nLao 


e 



and thus 

Ei[/^(j/r)] < /^(w*) + ef^yo) + V Zo (u*). 


4.6 Finding a Good Starting Point 

Here, we will describe how to find a good starting point for the algorithm. This will permit us to establish the 
quality-of-approximation and running time guarantees of Theorem 4.6. 

A good starting point yo = £ start for Algorithm 1 is an initial condition a; start that is not too far away from 
the optimal in terms of the function value (i.e small /^(t/o))- and n °t too far away from u* in A-norm (i.e. small 
V Zo (u*)). For packing problems, starting with all the all-O’s vector will work, but this will not work for covering 
problems. Instead, for covering problems, we will show now a good enough x start can be obtained in O(N). 

To do so, recall that we can get a 2-approximation x* to the original covering LP in time O(N) using 
various nearly linear time covering solvers, e.g., those of [5, 13], Without loss of generality, we can assume 
xf £ [0, | 4 2 | —], since we can use the diameter reduction process as specified in Lemma 3.1 to get a equivalent 
solution satisfying the conditions. Then, we have the following lemma. 

Lemma 4.12. Let x stalt = (1 + e/2)x#, we have x start <E A, / M (x start ) < 4 OPT, and V x ^t {u*) < 6 OPT 
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Proof. It is obvious that £ start G A. Thus, 

l T x start = (1 + e/2)l r x* < (1 + e/2)2 OPT < 3 OPT. 

Furthermore, we have Ax start — 1 > (1 + e/2 )Ax# — 1 > |1, and so 

Ux stait ) = /iV pj(x staTt ) + Fx- Start < n Y exp{~—) + 3 OPT < + 3 0 PT < 4 OPT. 

t-T* “ u (nm) z 

3 3 

For the divergence, we have that 

=I^||A J || 00 (xf art -<) 2 

i 

= \E P:iHoo(« art ) 2 + («*)• - 2xf art <) 

i 

<^E*f art +< 

i 

3 

< - (3 OPT + OPT) < 6 OPT, 

which proves the lemma. □ 

It is now clear that we have 

MfM\ < U(u*) + eMv o) + («*) < (1 + e) OPT +4e OPT +e OPT = (1 + 6c) OPT. 

Thus, we have the approximation guarantee in Theorem 4.6. The running time follows directly from Lemma 4.9 

and T = 0(n/e). 

Acknowledgments. DW was supported by ARO Grant W91 INF-12-1-0541, SR was funded by NSF Grant 
CCF-1118083, and MM acknowledges the support of the NSF, AFOSR, and DARPA. 

Appendix A Missing Proofs 

The following proofs can be found in [1], and we include them here for completeness. 

Lemma 4.1. OPT G [1 ,m] 

Proof. By the assumption min Jg [ m ] \\Aj- ||oo = 1, we know at least one constraint has all coefficients at most 1, 
so to satisfy that constraint, we must have the sum of the variables to be at least 1. On the other hand, since each 
constraint has a variable with coefficient at least 1 in it, x = 1 clearly satisfies all constraints, so OPT < to. □ 

Lemma 4.4. Setting the smoothing parameter p = 41 og (ram/e) ’ we ^ ave 

L Mu*) < (1 + e) OPT. 

2. f/jix) > (1 — e) OPT for any x > 0. 

3. For any x > 0 satisfying f^(x) < 2 OPT, we must have Ax > (1 — e)l. 

4. If x > 0 satisfies fi_i(x) < (1 + O(e)) OPT, then yzyX is a (1 + 0(e))-approximation to the covering LP. 

5. The gradient of f^(x) is 

V/ P (x) = 1 — A T p(x) where Pj(x) = exp(—(1 — (Ax)j), 

F 

andVif^x) = 1 A jiPj( x ) e h°°> !]• 
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Proof. 1. Since Ax* > 1, and u* = (1 + e/2)x*, we have ( Au*)j — 1 > e/2 for all j. Then Pj(u*) < 
exp(-If) = (±) 2 , and /„(«*) = 1 T u* + Pj(u*) < (1 + e/2) OPT +pm{±) 2 < (1 + 

e) OPT. 

2. By contradiction, suppose / M (at) < (1 — e) OPT, since f k (x) < OPT < to, we must have pj (x) <m/p 
for any j, which implies {Ax)j > 1 — e. By definition of OPT, we have l T x > (1 — e) OPT, since 
Ax > (1 — e)l. This gives a contradiction as / M (a;) > l T x > (1 — e) OPT. 

3. By contradiction, suppose there is some j such that ( Ax)j — 1 < —e, then as in the last part, we have 

HPj (x) > ) 4 > 2 OPT, contradicting f,Jx) < 2 OPT. 

4. For any x satisfying / M ( x) < (1 + 0(e)) OPT < 2 OPT, by last part we know Ax > (1 — e)l, so 

A(j^x) > 1. We also have < (1 + 0(e)) OPT. 

5. This is by straightforward computation. 

□ 


Lemma 4.8. We have Xk,yk, z k £ A for all k = 0, 1,..., T. 

Proof At the start Xq = y 0 = zq = ai start £ A by assumption. z k is always in A as we take the projection in 
the mirror descent step. If we can further show y k £ A for all k, we are done, since Xk is a convex combination 
of yk-i,Zk-i- To show y k £ A, we write y k as a convex combination of zo,.. ■, Zk, yk = S?=o c k z i■ At k = 0, 
we have y 0 = z 0 , and at k = 1, y k = xi + 7^(21 - z 0 ) = 7^21 + (1 - 77^77)20, as xi = y 0 = z 0 . For 
k > 2, we can verify 


Cfc 


(1 -r)4_i 

_ I / 1 _ 1 \ _j_ Q _ 1 _ \ 

\ \ nak—iL na.kL' ' V not-k—iL ' 


noikL 


l = 0,...,k-2 
l = k - 1 
l = k 


since 


Vk = Xk + 


1 


-(Zk - Zk- 1) 


na k L 

= TZ k -1 + (1 - T)y k - 1 + 


nakL 


(■Zk - z k - 1) 


k—2 


= TZk- 1 + (1 - t)(S 2, c'k-lZl H ---p2fc_i) H -“ T {Zk - 2fc-l) 


1 =0 


k—2 


= O X 1 - T ) c k~i z i) + (( 


1 


nak-iL 


1 


nakL 


;=o 


nak-iL nakL 


) + t( 1 - 


1 


nak-iL 


))Zk -1 


na k L 


Zk 


As afc > ctfc-i, and ao = -7-, we have 4 > 0 for all /, and it is easy to check the coefficients sum to 1 for 


each k. 


□ 


Lemma 4.9. Each iteration can be implemented in expected ()(N/n) time. 

Proof. We show how to implement a iteration conditioned on i in time 0(||A : j||o), where ||A : ,||o is the number 
of non-zeros in column i, thus give a expected running time of O(Nfn) for each iteration. We maintain the 
following quantities 


z k £ R> 0 , azk £ K> 0 ,Z/fe £ R n ,ay fc £ R m , £ fc ,i, B fcj2 £ R+ 

with the following invariants always satisfied throughout the algorithm 


Az k = az k (8) 

Vk = B k pz k + B k py'k , Ay k = B k paz k + B k pay k (9) 

When k = 0, we let az k = Az 0 , y' k = yo, ayk = Ayo, B k i = 0, B k p = 1, and it is clear all the invariants are 
satisfied. For k = 1,2,... ,T: 
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• The step x k = TZk-i + (1 — f)y k - 1 does not need to be implemented. 

• Computation of Vi/(tCfc) requires the value of Pj{x k ) = exp(^(l — (Ax k )j)) for each/ such that Aji ^ 0, 
and we can get the value 

{Ax k )j = r(Azk-i)j + (1 - T)(Ay k _i)j = (t + (1 - T)B k -i,i)(azh-i)j + (1 - T)B k _ 1)2 ay k - ltj 

for each such j. This can be computed in 0(1) time for each j, and 0(||A : j||o) time in total. 

• The mirror descent step z^ ] = argmin z6A {y Z( ._ 1 {z) + (z,na k ^)} is simply z k = z k + <5e; where 
S G K. can be computed in 0(1) time. z k = z k -i + <5 ei yields y k = rz k -i + (1 — r)y k -i + n * L e; by 
the gradient descent step. Therefore, we can update the values accordingly 

z k ^— z k ~i + <5e,;, az k 4 — az k —i + SA-j 


and 


B k ,i <— t + (1 — r)B k - 1,1 B ki 2 (1 — T)B k -\,2 

Vk <- y'k-l + + nokL "BfcV ) ayk ayk ~ x + 


_t_Lu . 

nafcL B k ,2 ' •* 


We can verify that after the updates, the invariants still hold 


Vk —B kt iz k + B k gy k — B kt i(z k -i + S ei) + B kt 2{jj k -i + <5(— 77 "—I- 7 x 5 —) e i) 

i>fc,2 nat-l B kt 2 

=B k ,iz k -i +l?fc,2(j/fc_i + < 5 (-7-5—)©i) 

na k L B kt 2 

£ 

=Bk,i z k-i + Bk,2y'k-i h -r e * 

nakL 

= (t + (1 - T)B k -i,i)zk-i + ((1 - T)B k -i 2)y k -\ + H - 7 e 4 

na k L 

, <5 

=rz k -i + (1 - r)y fc _i + H-- e j; 

na k L 

It is also straightforward to verify Ay k = B k ,\az k + B k ,2Ciy k equals Ay k = rAz k -1 + (1 — r)Ay k -i + 
+ na k L A e i- The updates are dominated by the updates on az k and ay k , which take 0(|| Ai ||o) time. 

□ 
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